Model Selection

Image-Text Retrieval

# Image-Text Retrieval

Siglip2 So400m Patch16 Naflex

SigLIP 2 is an improved model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Base Patch16 Naflex

SigLIP 2 is a multilingual vision-language encoder that integrates SigLIP's pretraining objectives and introduces new training schemes, enhancing semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 So400m Patch16 512

SigLIP 2 is a vision-language model based on SigLIP, enhanced with improved semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 So400m Patch16 384

SigLIP 2 is an improved model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 So400m Patch16 256

SigLIP 2 is an improved model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Giant Opt Patch16 384

SigLIP 2 is an improved model based on the SigLIP pretraining objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Giant Opt Patch16 256

SigLIP 2 is an advanced vision-language model that integrates multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Large Patch16 384

SigLIP 2 is an improved multilingual vision-language encoder based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Large Patch16 256

SigLIP 2 is an improved vision-language model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Base Patch16 512

SigLIP 2 is a vision-language model that integrates multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Base Patch16 384

SigLIP 2 is a vision-language model based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction through a unified training approach.

Siglip2 Base Patch16 256

SigLIP 2 is a multilingual vision-language encoder with improved semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Base Patch16 224

SigLIP 2 is an improved multilingual vision-language encoder based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Base Patch32 256

SigLIP 2 is an improved version of SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Tic CLIP Basic Oracle

TiC-CLIP is an improved vision-language model based on OpenCLIP, focusing on continual temporal learning, with training data spanning from 2014 to 2022

Clip Japanese Base

A Japanese CLIP model developed by LY Corporation, trained on approximately 1 billion web-collected image-text pairs, suitable for various vision tasks.

Transformers Japanese

line-corporation

Japanese Clip Vit B 32 Roberta Base

A Japanese version of the CLIP model that maps Japanese text and images into the same embedding space, suitable for zero-shot image classification, text-image retrieval, and other tasks.

Transformers Japanese

ALIGN is a vision-language dual-encoder model that aligns image and text representations through contrastive learning, achieving state-of-the-art cross-modal representation with large-scale noisy data.

Multimodal Alignment

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase